Problem Statement

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries

In [1]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user

Note:

  1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.

  2. On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [2]:
# The data wranglers:
import pandas as pd
import numpy as np

# data visualization libraries:
import matplotlib.pyplot as plt
import seaborn as sns

# to split the data
from sklearn.model_selection import train_test_split

# to build the model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To get the scores
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    confusion_matrix
)
# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

Loading the dataset

In [3]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [4]:
data = pd.read_csv("/content/drive/MyDrive/Data Science McCombs Class/bank decision tree/Loan_Modelling.csv")
In [5]:
df=data.copy()

Data Overview

  • Observations
  • Sanity checks
In [6]:
df.head()
Out[6]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [7]:
df.tail()
Out[7]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1
In [8]:
df.shape
Out[8]:
(5000, 14)
In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

there are a lot of numerical variables that we wil lhave to make cahtegorical and dummies through one hot encoding

In [10]:
df.describe()
Out[10]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000
mean 2500.500000 45.338400 20.104600 73.774200 93169.257000 2.396400 1.937938 1.881000 56.498800 0.096000 0.104400 0.06040 0.596800 0.294000
std 1443.520003 11.463166 11.467954 46.033729 1759.455086 1.147663 1.747659 0.839869 101.713802 0.294621 0.305809 0.23825 0.490589 0.455637
min 1.000000 23.000000 -3.000000 8.000000 90005.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 1250.750000 35.000000 10.000000 39.000000 91911.000000 1.000000 0.700000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
50% 2500.500000 45.000000 20.000000 64.000000 93437.000000 2.000000 1.500000 2.000000 0.000000 0.000000 0.000000 0.00000 1.000000 0.000000
75% 3750.250000 55.000000 30.000000 98.000000 94608.000000 3.000000 2.500000 3.000000 101.000000 0.000000 0.000000 0.00000 1.000000 1.000000
max 5000.000000 67.000000 43.000000 224.000000 96651.000000 4.000000 10.000000 3.000000 635.000000 1.000000 1.000000 1.00000 1.000000 1.000000

since our target variable in this classification problem is a dummy variable this isnt very exciting yet...

on average our customers are 45, make 73k, have a family size of 2.39, spend, close to 2k a month, have at least an undergrad, and have a MEAN morgage of 56k since most have no home.

In [11]:
df.isnull().sum()
Out[11]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0

no nulls

In [12]:
df = df.drop(['ID'], axis=1)
In [13]:
#grab all the unique values for the numerical columns but not the floats since there will be tyoo  many and there are no categoricals
sanity_check = df.select_dtypes(include=['int64']).columns
for var in sanity_check:
    print(var)
    print(df[var].unique())
    print()
Age
[25 45 39 35 37 53 50 34 65 29 48 59 67 60 38 42 46 55 56 57 44 36 43 40
 30 31 51 32 61 41 28 49 47 62 58 54 33 27 66 24 52 26 64 63 23]

Experience
[ 1 19 15  9  8 13 27 24 10 39  5 23 32 41 30 14 18 21 28 31 11 16 20 35
  6 25  7 12 26 37 17  2 36 29  3 22 -1 34  0 38 40 33  4 -2 42 -3 43]

Income
[ 49  34  11 100  45  29  72  22  81 180 105 114  40 112 130 193  21  25
  63  62  43 152  83 158  48 119  35  41  18  50 121  71 141  80  84  60
 132 104  52 194   8 131 190  44 139  93 188  39 125  32  20 115  69  85
 135  12 133  19  82 109  42  78  51 113 118  64 161  94  15  74  30  38
   9  92  61  73  70 149  98 128  31  58  54 124 163  24  79 134  23  13
 138 171 168  65  10 148 159 169 144 165  59  68  91 172  55 155  53  89
  28  75 170 120  99 111  33 129 122 150 195 110 101 191 140 153 173 174
  90 179 145 200 183 182  88 160 205 164  14 175 103 108 185 204 154 102
 192 202 162 142  95 184 181 143 123 178 198 201 203 189 151 199 224 218]

ZIPCode
[91107 90089 94720 94112 91330 92121 91711 93943 93023 94710 90277 93106
 94920 91741 95054 95010 94305 91604 94015 90095 91320 95521 95064 90064
 94539 94104 94117 94801 94035 92647 95814 94114 94115 92672 94122 90019
 95616 94065 95014 91380 95747 92373 92093 94005 90245 95819 94022 90404
 93407 94523 90024 91360 95670 95123 90045 91335 93907 92007 94606 94611
 94901 92220 93305 95134 94612 92507 91730 94501 94303 94105 94550 92612
 95617 92374 94080 94608 93555 93311 94704 92717 92037 95136 94542 94143
 91775 92703 92354 92024 92831 92833 94304 90057 92130 91301 92096 92646
 92182 92131 93720 90840 95035 93010 94928 95831 91770 90007 94102 91423
 93955 94107 92834 93117 94551 94596 94025 94545 95053 90036 91125 95120
 94706 95827 90503 90250 95817 95503 93111 94132 95818 91942 90401 93524
 95133 92173 94043 92521 92122 93118 92697 94577 91345 94123 92152 91355
 94609 94306 96150 94110 94707 91326 90291 92807 95051 94085 92677 92614
 92626 94583 92103 92691 92407 90504 94002 95039 94063 94923 95023 90058
 92126 94118 90029 92806 94806 92110 94536 90623 92069 92843 92120 95605
 90740 91207 95929 93437 90630 90034 90266 95630 93657 92038 91304 92606
 92192 90745 95060 94301 92692 92101 94610 90254 94590 92028 92054 92029
 93105 91941 92346 94402 94618 94904 93077 95482 91709 91311 94509 92866
 91745 94111 94309 90073 92333 90505 94998 94086 94709 95825 90509 93108
 94588 91706 92109 92068 95841 92123 91342 90232 92634 91006 91768 90028
 92008 95112 92154 92115 92177 90640 94607 92780 90009 92518 91007 93014
 94024 90027 95207 90717 94534 94010 91614 94234 90210 95020 92870 92124
 90049 94521 95678 95045 92653 92821 90025 92835 91910 94701 91129 90071
 96651 94960 91902 90033 95621 90037 90005 93940 91109 93009 93561 95126
 94109 93107 94591 92251 92648 92709 91754 92009 96064 91103 91030 90066
 95403 91016 95348 91950 95822 94538 92056 93063 91040 92661 94061 95758
 96091 94066 94939 95138 95762 92064 94708 92106 92116 91302 90048 90405
 92325 91116 92868 90638 90747 93611 95833 91605 92675 90650 95820 90018
 93711 95973 92886 95812 91203 91105 95008 90016 90035 92129 90720 94949
 90041 95003 95192 91101 94126 90230 93101 91365 91367 91763 92660 92104
 91361 90011 90032 95354 94546 92673 95741 95351 92399 90274 94087 90044
 94131 94124 95032 90212 93109 94019 95828 90086 94555 93033 93022 91343
 91911 94803 94553 95211 90304 92084 90601 92704 92350 94705 93401 90502
 94571 95070 92735 95037 95135 94028 96003 91024 90065 95405 95370 93727
 92867 95821 94566 95125 94526 94604 96008 93065 96001 95006 90639 92630
 95307 91801 94302 91710 93950 90059 94108 94558 93933 92161 94507 94575
 95449 93403 93460 95005 93302 94040 91401 95816 92624 95131 94965 91784
 91765 90280 95422 95518 95193 92694 90275 90272 91791 92705 91773 93003
 90755 96145 94703 96094 95842 94116 90068 94970 90813 94404 94598]

Family
[4 3 1 2]

Education
[1 2 3]

Mortgage
[  0 155 104 134 111 260 163 159  97 122 193 198 285 412 153 211 207 240
 455 112 336 132 118 174 126 236 166 136 309 103 366 101 251 276 161 149
 188 116 135 244 164  81 315 140  95  89  90 105 100 282 209 249  91  98
 145 150 169 280  99  78 264 113 117 325 121 138  77 158 109 131 391  88
 129 196 617 123 167 190 248  82 402 360 392 185 419 270 148 466 175 147
 220 133 182 290 125 124 224 141 119 139 115 458 172 156 547 470 304 221
 108 179 271 378 176  76 314  87 203 180 230 137 152 485 300 272 144  94
 208 275  83 218 327 322 205 227 239  85 160 364 449  75 107  92 187 355
 106 587 214 307 263 310 127 252 170 265 177 305 372  79 301 232 289 212
 250  84 130 303 256 259 204 524 157 231 287 247 333 229 357 361 294  86
 329 142 184 442 233 215 394 475 197 228 297 128 241 437 178 428 162 234
 257 219 337 382 397 181 120 380 200 433 222 483 154 171 146 110 201 277
 268 237 102  93 354 195 194 238 226 318 342 266 114 245 341 421 359 565
 319 151 267 601 567 352 284 199  80 334 389 186 246 589 242 143 323 535
 293 398 343 255 311 446 223 262 422 192 217 168 299 505 400 165 183 326
 298 569 374 216 191 408 406 452 432 312 477 396 582 358 213 467 331 295
 235 635 385 328 522 496 415 461 344 206 368 321 296 373 292 383 427 189
 202  96 429 431 286 508 210 416 553 403 225 500 313 410 273 381 330 345
 253 258 351 353 308 278 464 509 243 173 481 281 306 577 302 405 571 581
 550 283 612 590 541]

Personal_Loan
[0 1]

Securities_Account
[1 0]

CD_Account
[0 1]

Online
[0 1]

CreditCard
[0 1]

here we have some negative numbers that were probably typos

In [14]:
df["Experience"].replace(-1, 1, inplace=True)
df["Experience"].replace(-2, 2, inplace=True)
df["Experience"].replace(-3, 3, inplace=True)
In [15]:
#make some of the variables categorical
cat_cols = [
    "Education",
    "Personal_Loan",
    "Securities_Account",
    "CD_Account",
    "Online",
    "CreditCard",
    "ZIPCode",
    "Family"
]
df[cat_cols] = df[cat_cols].astype("category")
In [16]:
df.duplicated().sum()
Out[16]:
0

Exploratory Data Analysis.

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
In [17]:
sns.countplot(df["CreditCard"])
plt.show()
In [18]:
credit_card_ones = df[df['CreditCard'] == 1].shape[0]
print(f"Number of customers with CreditCard = 1: {credit_card_ones}")
Number of customers with CreditCard = 1: 1470
In [19]:
sns.countplot(df["Personal_Loan"])
plt.show()

this is our target variable and as we can see there are much fewer people who took the loan... our customer base is expanding rapidly so whatever patterns we see in this old data should hold true for the rest of the people in this economy. as long as we arent expanding into new markets where cultural norms and frugaility change...

In [20]:
sns.countplot(df["Securities_Account"])
plt.show()
In [21]:
sns.countplot(df["CD_Account"])
plt.show()
In [22]:
sns.countplot(df["Online"])
plt.show()
In [23]:
sns.countplot(df["Education"])
plt.show()
In [24]:
sns.countplot(df["Family"])
plt.show()
In [25]:
sns.countplot(df["ZIPCode"])
plt.show()
In [26]:
df["ZIPCode"].nunique()
Out[26]:
467

this is good in case a storm comes along we have lots of eggs in baskets.

In [27]:
#create bins for the zip codes so we can model thim easier.
df["ZIPCodeZone"] = df["ZIPCode"].astype(str)
print(
    "Number of unique values if we take first two digits of ZIPCode: ",
    df["ZIPCodeZone"].str[0:2].nunique(),
)
df["ZIPCodeZone"] = df["ZIPCodeZone"].str[0:2]

df["ZIPCodeZone"] = df["ZIPCodeZone"].astype("category")
Number of unique values if we take first two digits of ZIPCode:  7
In [28]:
sns.countplot(df["ZIPCodeZone"])
plt.show()
In [29]:
sns.histplot(df["Mortgage"], kde=True)
plt.show()

vast majority are homeless

In [30]:
sns.boxplot(df["Mortgage"])
plt.show()

even people with average priced homes are outliers in the dataset

In [31]:
sns.histplot(df["CCAvg"], kde=True)
plt.show()

most people spend under 2k but some people spend as much as 10k. it tapers off significantlly after 3k.

In [32]:
sns.boxplot(df["CCAvg"])
plt.show()
In [33]:
sns.boxplot(df["Age"])
plt.show()
In [34]:
sns.histplot(df["Age"], kde=True)
plt.show()

the uniform distribution is due to the natural human lifecycle

In [35]:
sns.boxplot(df["Experience"])
plt.show()
In [36]:
sns.histplot(df["Experience"], kde=True)
plt.show()

similar reasoning to the lifecycle

In [37]:
sns.boxplot(df["Income"])
plt.show()

half of the population make between like 40k and 100k anything over 190 is an outlier

In [38]:
sns.histplot(df["Income"], kde=True)
plt.show()

we have a righ tail here that shoots up to 50k and flows down gently to the outliers above 200

Multivariate

In [39]:
sns.pairplot(df, hue="Personal_Loan")
plt.show()

people who make above 100k are a lot more likely to take out the loan. average spending over 3k seems to be a good splittin gpoint as well. mortgages might have an impact more than age but it is less than some others.

In [40]:
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

income and spending is the only interesting correlation

we need to now look at our categoricalk variables to see how they might impact the target variable.

In [41]:
pd.crosstab(df["Education"], df["Personal_Loan"]).plot(kind="bar")
plt.show()

a little increase in personal loans with more education but nothing exciting

In [42]:
pd.crosstab(df["Family"], df["Personal_Loan"]).plot(kind="bar")
plt.show()

nothing too interestring

In [43]:
pd.crosstab(df["ZIPCode"], df["Personal_Loan"]).plot(kind="bar")
plt.show()

if we had the patience and procce3ssing power this would be very interesting as some zip codes have very high levels of loans

In [44]:
pd.crosstab(df["ZIPCodeZone"], df["Personal_Loan"]).plot(kind="bar")
plt.show()

when its broken down by zone there is no real change but these all could be important to the model after one or two splits.

In [45]:
pd.crosstab(df["CCAvg"], df["Personal_Loan"]).plot(kind="bar")
plt.show()

the amount of loans is constant but as a proportion they do increase with spending

In [46]:
bar_plot = pd.crosstab(df["CCAvg"], df["Personal_Loan"]).plot(kind="bar", figsize=(12, 6))  # Increase figure size

# Rotate x-axis labels for better readability
plt.xticks(rotation=90, ha='center')  # Rotate by 90 degrees for maximum space
plt.xlabel("CCAvg", fontsize=12)  # Add x-axis label with larger font
plt.ylabel("Count", fontsize=12)  # Add y-axis label
Out[46]:
Text(0, 0.5, 'Count')

upon closer inspection they sure do increase as a pproportion

In [47]:
pd.crosstab(df["Age"], df["Personal_Loan"]).plot(kind="bar")
plt.show()

its pretty constant except for the tails of the distribution very young people and very old people dont seem to like our loans

In [48]:
pd.crosstab(df["Experience"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
In [49]:
pd.crosstab(df["Income"], df["Personal_Loan"]).plot(kind="bar")
plt.show()

a very similar story to the spending but even more intersting

In [50]:
# Bin 'Income' into ranges
bins = [0, 50, 100, 150, 200, 300]  # Adjust ranges based on your data
labels = ["0-50", "50-100", "100-150", "150-200", "200+"]
df["Income_Binned"] = pd.cut(df["Income"], bins=bins, labels=labels, include_lowest=True)

# Plot the crosstab with binned income
pd.crosstab(df["Income_Binned"], df["Personal_Loan"]).plot(kind="bar", figsize=(12, 6))

# Adjust plot settings
plt.xticks(rotation=45, ha='right')
plt.xlabel("Income Range")
plt.ylabel("Count")
plt.title("Income Range vs Personal Loan")
plt.tight_layout()
plt.show()

the peopel making 150-200 are like 50% likely to get our loan

In [51]:
pd.crosstab(df["CD_Account"], df["Personal_Loan"]).plot(kind="bar")
plt.show()

same thing if they have a cd account with us

In [52]:
pd.crosstab(df["Securities_Account"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
In [53]:
pd.crosstab(df["Online"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
In [54]:
pd.crosstab(df["CreditCard"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
In [55]:
sns.pairplot(df, hue="Personal_Loan")
plt.show()
In [56]:
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Data Preprocessing

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

The professor said in the lectures not to worry about outliers and they could even be helpful in a classification problem so im ignoring them because it looks like a lot of people did the same in the discussion

In [57]:
#check the data types and bins you created so you can play with them
In [58]:
#we dont need this
df.drop("Income_Binned", axis=1, inplace=True)
In [59]:
# dropping Experience as it is perfectly correlated with Age, zip code bc we dont have time for that and personal loan bc its our y
X = df.drop(["Personal_Loan", "ZIPCode", "Experience"], axis=1)
Y = df["Personal_Loan"]

#creating the dummies
X = pd.get_dummies(X, columns=["ZIPCodeZone", "Family", "Education"], drop_first=True)
X = X.astype(float)

# Splitting data in train and test sets with a quarter in the test
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.25, random_state=1
)
In [60]:
#make sure everything looks like you expect it to here before you build anything
X.head()
Out[60]:
Age Income CCAvg Mortgage Securities_Account CD_Account Online CreditCard ZIPCodeZone_91 ZIPCodeZone_92 ZIPCodeZone_93 ZIPCodeZone_94 ZIPCodeZone_95 ZIPCodeZone_96 Family_2 Family_3 Family_4 Education_2 Education_3
0 25.0 49.0 1.6 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 45.0 34.0 1.5 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2 39.0 11.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 35.0 100.0 2.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 35.0 45.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
In [61]:
#we need to make sure the test and train sets have a similar proportion of our target variable
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (3750, 19)
Shape of test set :  (1250, 19)
Percentage of classes in training set:
0    0.9064
1    0.0936
Name: Personal_Loan, dtype: float64
Percentage of classes in test set:
0    0.8968
1    0.1032
Name: Personal_Loan, dtype: float64

this looks good we can use this split

Model Building

Model Evaluation Criterion

*

In [62]:
# defining a function to compute the performace metrics
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)
    recall = recall_score(target, pred)
    precision = precision_score(target, pred)
    f1 = f1_score(target, pred)

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [63]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    #predict y using the model
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    #labels for matrix
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Model Building

In [64]:
#create instance of model
dtree1 = DecisionTreeClassifier(criterion="gini", random_state=1)
#fit the model to the training data
dtree1.fit(X_train, y_train)
Out[64]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [65]:
confusion_matrix_sklearn(dtree1, X_train, y_train)
In [66]:
#grab the metrics store them for later
dtree1_train_perf = model_performance_classification_sklearn(
    dtree1, X_train, y_train
)
dtree1_train_perf
Out[66]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [67]:
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCodeZone_91', 'ZIPCodeZone_92', 'ZIPCodeZone_93', 'ZIPCodeZone_94', 'ZIPCodeZone_95', 'ZIPCodeZone_96', 'Family_2', 'Family_3', 'Family_4', 'Education_2', 'Education_3']
In [68]:
plt.figure(figsize=(20, 30))
#plot the tree
out = tree.plot_tree(
    dtree1,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
#add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

the left side of the tree gets fatirly developed early on and the right side takes a bit longer to get the improved impurity neccessary for the leaf nodes in this initial tree. a lot of leaves have only 1 sample and this tree is too complicated to read

In [69]:
print(tree.export_text(dtree1, feature_names=feature_names, show_weights=True))
|--- Income <= 113.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2741.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Family_4 <= 0.50
|   |   |   |   |--- ZIPCodeZone_93 <= 0.50
|   |   |   |   |   |--- Age <= 37.00
|   |   |   |   |   |   |--- Age <= 34.00
|   |   |   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  34.00
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Age >  37.00
|   |   |   |   |   |   |--- weights: [33.00, 0.00] class: 0
|   |   |   |   |--- ZIPCodeZone_93 >  0.50
|   |   |   |   |   |--- Family_2 <= 0.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Family_2 >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Family_4 >  0.50
|   |   |   |   |--- Age <= 35.00
|   |   |   |   |   |--- CCAvg <= 2.40
|   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.40
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  35.00
|   |   |   |   |   |--- CCAvg <= 2.15
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |--- CCAvg >  2.15
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- CCAvg <= 3.90
|   |   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |--- Family_4 <= 0.50
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |--- CCAvg >  3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Family_4 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |--- ZIPCodeZone_91 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [44.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ZIPCodeZone_91 >  0.50
|   |   |   |   |   |   |   |   |   |--- Family_2 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Family_2 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |--- ZIPCodeZone_92 <= 0.50
|   |   |   |   |   |   |   |   |   |--- ZIPCodeZone_94 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- ZIPCodeZone_94 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- Family_4 <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Family_4 >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |--- ZIPCodeZone_92 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |--- Age <= 30.00
|   |   |   |   |   |   |   |   |   |--- Family_2 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Family_2 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  30.00
|   |   |   |   |   |   |   |   |   |--- weights: [16.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.90
|   |   |   |   |   |   |--- weights: [42.00, 0.00] class: 0
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- Family_2 <= 0.50
|   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- Family_2 >  0.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |--- Income >  92.50
|   |   |   |--- Education_3 <= 0.50
|   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |--- Age <= 29.50
|   |   |   |   |   |   |   |--- Age <= 27.00
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  27.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Age >  29.50
|   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |--- Age <= 55.00
|   |   |   |   |   |   |   |   |   |--- Family_3 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Family_3 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Age >  55.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |--- CCAvg <= 4.75
|   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |--- CCAvg >  4.75
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |--- weights: [0.00, 10.00] class: 1
|   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |--- Education_3 >  0.50
|   |   |   |   |--- Family_2 <= 0.50
|   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 12.00] class: 1
|   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |--- Age <= 51.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  51.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Family_2 >  0.50
|   |   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  100.00
|   |   |   |   |   |   |--- ZIPCodeZone_95 <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |--- ZIPCodeZone_95 >  0.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|--- Income >  113.50
|   |--- Education_3 <= 0.50
|   |   |--- Education_2 <= 0.50
|   |   |   |--- Family_3 <= 0.50
|   |   |   |   |--- Family_4 <= 0.50
|   |   |   |   |   |--- weights: [414.00, 0.00] class: 0
|   |   |   |   |--- Family_4 >  0.50
|   |   |   |   |   |--- weights: [0.00, 17.00] class: 1
|   |   |   |--- Family_3 >  0.50
|   |   |   |   |--- weights: [0.00, 33.00] class: 1
|   |   |--- Education_2 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 2.80
|   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  2.80
|   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 113.00] class: 1
|   |--- Education_3 >  0.50
|   |   |--- Income <= 116.50
|   |   |   |--- Online <= 0.50
|   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |--- Online >  0.50
|   |   |   |   |--- CCAvg <= 2.25
|   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  2.25
|   |   |   |   |   |--- ZIPCodeZone_92 <= 0.50
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- ZIPCodeZone_92 >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |--- Income >  116.50
|   |   |   |--- weights: [0.00, 121.00] class: 1

In [70]:
#compute the  Gini importance for this overly complicated tree because a lot of these features wont even show up in our pruned trees

print(
    pd.DataFrame(
        dtree1.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.329080
Education_2         0.225952
Education_3         0.156349
Family_3            0.091799
Family_4            0.058952
CCAvg               0.050987
Age                 0.028507
CD_Account          0.020608
Family_2            0.013432
CreditCard          0.006548
ZIPCodeZone_92      0.005815
Online              0.005779
ZIPCodeZone_95      0.002515
ZIPCodeZone_94      0.002357
ZIPCodeZone_93      0.000755
ZIPCodeZone_91      0.000564
ZIPCodeZone_96      0.000000
Securities_Account  0.000000
Mortgage            0.000000
In [71]:
importances = dtree1.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

this model is practically useless other than the feature importances here. surprisingly, average spending has less impact than education and family size. income is the most useful and age is somewhat important everything else will likely not be included at all in a pruned tree.

In [72]:
confusion_matrix_sklearn(dtree1,X_test,y_test)
In [73]:
dtree1_test_perf = model_performance_classification_sklearn(dtree1,X_test,y_test)
dtree1_test_perf
Out[73]:
Accuracy Recall Precision F1
0 0.976 0.914729 0.861314 0.887218

The idea is to get simpler model which we can use to bin customers so we can do segmentatioon rather than more or less personalized marketing. This tree is too complex for our marketing budget. it doesnt do a necessarily bad job on the test data but its not practical to use this for advertising.

i decided to build a model that tries to maximize recall because i want to see if we can find all the people who took out the loan the last time around. we dont want to neglect FP's because that s a waste of resources but its still a good idea to try and figure out where all the actual positives are so we can understand who NEEDS to see an ad.

Model Performance Improvement

In [74]:
# Define the parameters
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = [10, 15, 25, 50, 75, 150, 250]
min_samples_split_values = [10, 20, 30, 50, 70]

# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            # Initialize the tree with the current set of parameters
            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                class_weight='balanced', # Changed 'lass_weight' to 'class_weight' to address class imbalance
                random_state=1
            )

            # Fit the model to the training data
            estimator.fit(X_train, y_train)

            # Make predictions on the training and test sets
            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)

            # Calculate recall scores
            train_recall_score = recall_score(y_train, y_train_pred)
            test_recall_score = recall_score(y_test, y_test_pred)

            # Calculate the absolute difference between training and test recall scores
            score_diff = abs(train_recall_score - test_recall_score)

            # Update the best estimator and best score if the current one has a smaller score difference
            if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
                best_score_diff = score_diff
                best_test_score = test_recall_score
                best_estimator = estimator

# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
Best parameters found:
Max depth: 2
Max leaf nodes: 10
Min samples split: 10
Best test recall score: 1.0

After iterating over an initial set of parameters with the goal of finding the model with the best recall score difference between test and train to satisfy the business requirements using a simpler tree we found that the Best parameters were: Max depth: 2 Max leaf nodes: 10 Min samples split: 10 Best test recall score: 1.0 . We still need to look at how the model does on our other performance metrics to see if we can use it...

In [75]:
# Fit the best algorithm to the data.
dtree2 = best_estimator
dtree2.fit(X_train, y_train)
Out[75]:
DecisionTreeClassifier(class_weight='balanced', max_depth=2, max_leaf_nodes=10,
                       min_samples_split=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [76]:
confusion_matrix_sklearn(dtree2, X_train, y_train)
In [77]:
dtree2_train_perf = model_performance_classification_sklearn(dtree2, X_train, y_train)
dtree2_train_perf
Out[77]:
Accuracy Recall Precision F1
0 0.789867 1.0 0.308165 0.471141
In [78]:
confusion_matrix_sklearn(dtree2, X_test, y_test)
In [79]:
dtree2_test_perf = model_performance_classification_sklearn(dtree2, X_test, y_test)

our precision is horrible... but this tree will help us understand simply what groups our actual positives are in.

In [80]:
feature_names = list(X_train.columns)
In [81]:
plt.figure(figsize=(20, 20))

# plotting the decision tree
out = tree.plot_tree(
    dtree2,                         # decision tree classifier model
    feature_names=feature_names,    # list of feature names (columns) in the dataset
    filled=True,                    # fill the nodes with colors based on class
    fontsize=9,
    node_ids=False,
    class_names=None,
)

# add arrows to the decision tree splits if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)

plt.show()

if tthe customer makes less than 90k then we look at how much they spend on their credit card, if they make less we look at how educated they are... using these features gives us a perfect recall. we were able to find all the instances where someone purchased a loan but a large proportion predicted to be positive are false positive. this model would lead to a waste of marketing resources and potential risk management issues.

this is interesting though because it hints at what some of the features are that we should be lo0oking at.

In [82]:
# printing a text report showing the rules of a decision tree
print(
    tree.export_text(
        dtree2,
        feature_names=feature_names,
        show_weights=True
    )
)
|--- Income <= 92.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1440.31, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- weights: [67.85, 85.47] class: 1
|--- Income >  92.50
|   |--- Education_3 <= 0.50
|   |   |--- weights: [330.43, 1009.62] class: 1
|   |--- Education_3 >  0.50
|   |   |--- weights: [36.41, 779.91] class: 1

the third tree will be a second pre pruned tree that will look at the diffrent iterator variables and will pick the model that gets the smallest difference in f1 scores between the test and training sets

In [83]:
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = [10, 15, 25, 50, 75, 150, 250]
min_samples_split_values = [10, 20, 30, 50, 70]

# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            # Initialize the tree with the current set of parameters
            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                class_weight='balanced',
                random_state=1
            )

            # Fit the model to the training data
            estimator.fit(X_train, y_train)

            # Make predictions on the training and test sets
            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)

            # Calculate f1 scores for training and test sets
            train_f1_score = f1_score(y_train, y_train_pred)
            test_f1_score = f1_score(y_test, y_test_pred)

            # Calculate the absolute difference between training and test f1 scores
            score_diff = abs(train_f1_score - test_f1_score)

            # Update the best estimator and best score if the current one has a smaller score difference
            if (score_diff < best_score_diff) & (test_f1_score > best_test_score):
                best_score_diff = score_diff
                best_test_score = test_f1_score
                best_estimator = estimator

print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test f1 score: {best_test_score}")
Best parameters found:
Max depth: 6
Max leaf nodes: 15
Min samples split: 70
Best test f1 score: 0.8095238095238095

here the parameters are wildly different from our first pre pruned tree so we will keep the hyperparameters broad and since the f1 score with the smallest difference is a relatively low score we will have to look for a model that has the highest f1 on the test data instead.

In [84]:
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 7, 2)
max_leaf_nodes_values = [15, 25, 50, 75, 150, 250]
min_samples_split_values = [8, 10, 20, 30, 50, 70]

# Initialize variables to store the best model and its performance
best_estimator = None
best_test_f1_score = 0.0

# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            # Initialize the tree with the current set of parameters
            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                class_weight='balanced',
                random_state=1
            )

            # Fit the model to the training data
            estimator.fit(X_train, y_train)

            # Make predictions on the test set
            y_test_pred = estimator.predict(X_test)

            # Calculate the F1 score for the test set
            test_f1_score = f1_score(y_test, y_test_pred)

            # Update the best estimator if the current one has a higher test F1 score
            if test_f1_score > best_test_f1_score:
                best_test_f1_score = test_f1_score
                best_estimator = estimator

print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test F1 score: {best_test_f1_score}")
Best parameters found:
Max depth: 6
Max leaf nodes: 25
Min samples split: 20
Best test F1 score: 0.8571428571428571
In [85]:
# Fit the best algorithm to the data.
dtree3 = best_estimator
dtree3.fit(X_train, y_train)
Out[85]:
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=25,
                       min_samples_split=20, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [86]:
dtree3_train_perf = model_performance_classification_sklearn(dtree3,X_train,y_train)
dtree3_train_perf
Out[86]:
Accuracy Recall Precision F1
0 0.974667 0.991453 0.790909 0.879899
In [87]:
confusion_matrix_sklearn(dtree3, X_train, y_train)
In [88]:
confusion_matrix_sklearn(dtree3, X_test, y_test)
In [89]:
dtree3_test_perf = model_performance_classification_sklearn(dtree3,X_test,y_test)
dtree3_test_perf
Out[89]:
Accuracy Recall Precision F1
0 0.968 0.930233 0.794702 0.857143

the precision looks a lot better... we arent over estimating our marketing capacity! we are still getting all of the people who took the loans and our precision indicates that we are wasting less money advertising to people who will never get our product. now close to 80% of our predicted positives are engaging!

In [90]:
feature_names = list(X_train.columns)
In [91]:
plt.figure(figsize=(20, 20))

# plotting the decision tree
out = tree.plot_tree(
    dtree3,                         # decision tree classifier model
    feature_names=feature_names,    # list of feature names (columns) in the dataset
    filled=True,                    # fill the nodes with colors based on class
    fontsize=9,
    node_ids=False,
    class_names=None,
)

# add arrows to the decision tree splits if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)

plt.show()

Personally, I like these results better than the previous one. HOWEVER, it is too complicated for me to develop marketing startegies. we have limited time and patience so we will build one last tree in the hopes of simplifying the process further whilke keeping the purity of the tree in tact.

there are lots of ways to tune these hyperparameters depending on the business requirements and data you are working with. i would ask clkarifyinng questions to get a better idea about which of these would be best but ultimatelky the post pruned tree willl probably be the preffered one anyways.

our marketing deparement may choose this model but i wont get full marks unless i grow one last tree.

Post Pruning

In [92]:
#create an instance of a model
clf = DecisionTreeClassifier(random_state=1)
#compute the cost complexity pruning path for the model on the training data
path = clf.cost_complexity_pruning_path(X_train, y_train)
#grab all the effective alphas from the pruning path
ccp_alphas = abs(path.ccp_alphas)
#find the impurities corresponding to each alpha along the pruning path
impurities = path.impurities
In [93]:
pd.DataFrame(path)
Out[93]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000233 0.000467
2 0.000248 0.000962
3 0.000253 0.001467
4 0.000253 0.002478
5 0.000261 0.003000
6 0.000267 0.003267
7 0.000300 0.004467
8 0.000331 0.005459
9 0.000356 0.005814
10 0.000400 0.007014
11 0.000427 0.007441
12 0.000427 0.007868
13 0.000427 0.009577
14 0.000444 0.010022
15 0.000452 0.010474
16 0.000489 0.010963
17 0.000569 0.011532
18 0.000697 0.014322
19 0.000711 0.015744
20 0.000775 0.017294
21 0.000891 0.018185
22 0.001203 0.019388
23 0.001231 0.020618
24 0.001551 0.023719
25 0.001725 0.025445
26 0.002178 0.027622
27 0.002571 0.030194
28 0.002670 0.032864
29 0.005519 0.038383
30 0.008709 0.047092
31 0.015084 0.062176
32 0.030670 0.123516
33 0.046162 0.169678
In [94]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set");

Once alpha is less than .0025 or so there isnt much reduction in the impurity

In [95]:
#initialize an empty list
clfs = []
#iterate over each ccp alpha on the pruning path
for ccp_alpha in ccp_alphas:
  #create an instance of a decsion tree
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04616191957390159
In [96]:
#remove the trivial one node for bookeeping
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

#get the number of nodes for each
node_counts = [clf.tree_.node_count for clf in clfs]
#extract the max depth for each
depth = [clf.tree_.max_depth for clf in clfs]

fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")

ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
#to avoid overlap
fig.tight_layout()

when alpha is less than .0025 or so the tree is too complex...

In [97]:
#create an empy list for our recalls
recall_train = []
#iterate through each of the decision tree classifiers in clfs
for clf in clfs:
  #predict labels for the training set using the current tree classifier
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)

#same for test so we can plot both and pick a good alpha
recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)
In [98]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend();

the best compromise for complexity and simplicity is around 0.0025 or so for all the different graphs used to pick a best alpha

i chose .0025 as my alpha but tthat tree was too busy so i switched it to .005

In [99]:
#create the model where recall is highest
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(random_state=1)

i had an issue where the best alpha for the highest recall was 1 so i opted to choose my own as i got too complicated of a tree...

In [100]:
best_alpha = 0.005
print("Best model ccp_alpha:", best_alpha)
Best model ccp_alpha: 0.005
In [101]:
estimator_2 = DecisionTreeClassifier(ccp_alpha=0.005, class_weight={0: 0.15, 1: 0.85}, random_state=1)
estimator_2.fit(X_train, y_train)
print("Best model ccp_alpha:", best_alpha)
Best model ccp_alpha: 0.005

the model automatically chose 0 because i told it to make dtree3 the one with the best recall on the testing data... thats not reallly what i want so im going to choose an alpha myself based on the graph because i cant get the model to choose a reasonable best alpha.

In [102]:
best_alpha = 0.005
print("Best model ccp_alpha:", best_alpha)
Best model ccp_alpha: 0.005
In [103]:
estimator_2 = DecisionTreeClassifier(
    ccp_alpha=best_alpha,
    class_weight={0: 0.15, 1: 0.85},
    random_state=1
)
estimator_2.fit(X_train, y_train)
Out[103]:
DecisionTreeClassifier(ccp_alpha=0.005, class_weight={0: 0.15, 1: 0.85},
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [104]:
confusion_matrix_sklearn(estimator_2, X_train, y_train)
In [105]:
dtree4_train_perf = model_performance_classification_sklearn(estimator_2, X_train, y_train)
dtree4_train_perf
Out[105]:
Accuracy Recall Precision F1
0 0.972267 0.908832 0.815857 0.859838

i was worried about the perfectly trained post pruned tree being a sign of soem kind of error but the discussion forum and chat gpt told me not to worry.

In [106]:
confusion_matrix_sklearn(estimator_2, X_test, y_test)
In [107]:
dtree4_test_perf = model_performance_classification_sklearn(estimator_2, X_test, y_test)
dtree4_test_perf
Out[107]:
Accuracy Recall Precision F1
0 0.9632 0.860465 0.798561 0.828358
In [108]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator_2,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

this is more complicated than i would like it to be in a perfect world but it is probably as simple as the dataset allows without completely sacrificing our impurity measures

this is very good overall not just for recall and allows us to develop sophiusticated marketing strategies

Model Performance Comparison and Final Model Selection

In [109]:
models_train_comp_df = pd.concat(
    [dtree1_train_perf.T, dtree2_train_perf.T, dtree3_train_perf.T, dtree4_train_perf.T], axis=1,
)

models_train_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning-Recall)", "Decision Tree (Pre-Pruning-F1)", "Decision Tree (Post-Pruning)"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[109]:
Decision Tree (sklearn default) Decision Tree (Pre-Pruning-Recall) Decision Tree (Pre-Pruning-F1) Decision Tree (Post-Pruning)
Accuracy 1.0 0.789867 0.974667 0.972267
Recall 1.0 1.000000 0.991453 0.908832
Precision 1.0 0.308165 0.790909 0.815857
F1 1.0 0.471141 0.879899 0.859838
In [110]:
models_test_comp_df = pd.concat(
    [dtree1_test_perf.T, dtree2_test_perf.T, dtree3_test_perf.T, dtree4_test_perf.T], axis=1,
)

models_test_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning-Recall)", "Decision Tree (Pre-Pruning-F1)", "Decision Tree (Post-Pruning)"]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[110]:
Decision Tree (sklearn default) Decision Tree (Pre-Pruning-Recall) Decision Tree (Pre-Pruning-F1) Decision Tree (Post-Pruning)
Accuracy 0.976000 0.778400 0.968000 0.963200
Recall 0.914729 1.000000 0.930233 0.860465
Precision 0.861314 0.317734 0.794702 0.798561
F1 0.887218 0.482243 0.857143 0.828358

Actionable Insights and Business Recommendations

  • What recommedations would you suggest to the bank?
In [111]:
sns.pairplot(df, hue="Personal_Loan")
plt.show()
In [112]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator_2,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

if the marketintg department wants something more sophisticated they can look at the second pre pruned tree and get some more in depth analysis but ill be using the psot pruned tree because i think its a good balance between complexity and simplicity at least for this project.

there a lot of nodes with low impurity. marketing is harmeless, if you send someone an ad they didnt want to see its not the end of the world but it can be expensive. the right decision is to market to the people in the dark orange nodes and stay away from the ones in the blue ones

there arent too many people in the white leaf nodes so maybe you can use expertise to do targeting ads to them or something if you wan to impact the margins of your bottom line

there are two main reasons our depositors took on loans

  1. they are irresposibler and poorer and we can get them to spend money thesew guys are probably
  2. they have a larger family

we can do payday loans for the people who need quick cash

or

for those not making 100k and spending almost 3k they might not all be paying it off we could offer them rollover credit card debt to a personal loan with low intrest! but only if their card is with another finance corp!!! we could give these guys a low interst entry rate.

or

for the families a christmas loan for them to pay off next year. maybe.